Towards English - Swahili Machine Translation
نویسندگان
چکیده
Even though the Bantu language of Swahili is spoken by more than fifty million people in East and Central Africa, it is surprisingly resource-scarce from a language technological point of view, an unfortunate situation that holds for most, if not all languages on the continent. The increasing amount of digitally available, vernacular data has prompted researchers to investigate the applicability of corpus-based approaches to African language technology. In this vein, the SAWA corpus project attempts to collect and deploy a parallel corpus English Swahili, not only for the straightforward purpose of developing a machine translation system, but also to investigate the possibility of projection of annotation into a resource-scarce, African language. Compiling a balanced and expansive parallel corpus English Swahili is a rather daunting task. While monolingual Swahili data is abundantly available on the Internet, sourcing parallel texts is cumbersome. Even countries that have both English and Swahili as their official languages, such as Tanzania, Kenya and Uganda, do not tend to translate and/or publish all government documents bilingually. One therefore opportunistically collects whatever can be found in the public domain. At this point in the data collection phase, that means that the 2.2 million word parallel corpus is biased towards religious material, such as bible and quran translations. Nevertheless, the more interesting, secular part of the SAWA corpus (± 420k words) is steadily increasing, thanks to the inclusion of bilingual investment reports, manually translated movie subtitles, political documents and material kindly donated by local translators to the SAWA project. Each text in the SAWA corpus is automatically part-ofspeech tagged and lemmatized, using the TreeTagger for the English part (Schmid, 1994) and the systems described in De Pauw et al. (2006) and De Pauw and de Schryver (2008) for Swahili. These extra annotation layers allow us to perform more accurate automatic word alignment on the basis of factored data. Table 1: Precision, Recall and F-score for the wordalignment task using GIZA++. Prec. Recall F(β = 1)
منابع مشابه
SYNERGY: A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation
Developing Named Entity Recognition (NER) for a new language using standard techniques requires collecting and annotating large training resources, which is costly and time-consuming. Consequently, for many widely spoken languages such as Swahili, there are no freely available NER systems. We present here a new technique to perform NER for new languages using online machine translation systems....
متن کاملThe SAWA Corpus: A Parallel Corpus English - Swahili
Research in data-driven methods for Machine Translation has greatly benefited from the increasing availability of parallel corpora. Processing the same text in two different languages yields useful information on how words and phrases are translated from a source language into a target language. To investigate this, a parallel corpus is typically aligned by linking linguistic tokens in the sour...
متن کاملExploring the sawa corpus: collection and deployment of a parallel corpus English - Swahili
Research in machine translation and corpus annotation has greatly benefited from the increasing availability of word-aligned parallel corpora. This paper presents ongoing research on the development and application of the SAWA corpus, a two-million-word parallel corpus English—Swahili. We describe the data collection phase and zero in on the difficulties of finding appropriate and easily access...
متن کاملATLAS: Fujitsu machine translation system
l. Introduction In 1984 Fujitsu marketed the automatic machine translation systems, ATLAS-I and ATLAS II. ATLAS-I was the world's first commercial Eng1ish-Japanese translation system. Fujitsu is also conducting a joint project on research and development of a Japanese-Korean machine translation system in cooperation with Korean Advanced Institute of Science and Technology. ATLAS II aims at achi...
متن کاملA note on the translation of Swahili into English
Some features of the morphology of Swahili are discussed from the point of view of mechanizing a dictionary. A preliminary program is described. To the best of my knowledge, no work has previously been carried out on the mechanical translation of any Bantu language. This note is therefore a first suggestion of a possible basis for a scheme for the mechanical translation of Swahili into English....
متن کامل